Accelerating incremental checkpointing for extreme-scale computing

نویسندگان

Kurt B. Ferreira

Rolf Riesen

Patrick G. Bridges

Dorian C. Arnold

Ron Brightwell

چکیده

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Speculative Checkpointing

In large scale parallel systems, storing memory images with checkpointing will involve massive amounts of concentrated I/O from many nodes, resulting in considerable execution overhead. For user-level checkpointing, overhead reduction usually involves both spatial, i.e., reducing the amount of checkpoint data, and temporal, i.e., spreading out I/O by checkpointing data as soon as their values b...

متن کامل

A Case Study of Incremental and Background Hybrid In-Memory Checkpointing

Future exascale computing systems will have high failure rates due to the sheer number of components present in the system. A classic fault-tolerance technique used in today’s supercomputers is a checkpoint-restart mechanism. However, traditional hard disk-based checkpointing techniques will soon hit the scalability wall. Recently, many emerging non-volatile memory technologies, such as Phase-C...

متن کامل

libhashckpt: Hash-Based Incremental Checkpointing Using GPU's

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint str...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Future Generation Comp. Syst.

دوره 30 شماره

صفحات -

تاریخ انتشار 2014

Accelerating incremental checkpointing for extreme-scale computing

نویسندگان

چکیده

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Speculative Checkpointing

A Case Study of Incremental and Background Hybrid In-Memory Checkpointing

libhashckpt: Hash-Based Incremental Checkpointing Using GPU's

عنوان ژورنال:

اشتراک گذاری